AITopics | harmful task

Collaborating Authors

harmful task

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

Neural Information Processing SystemsJun-22-2026, 22:17:11 GMT

Despite their superior performance on a wide range of domains, large language models (LLMs) remain vulnerable to misuse for generating harmful content, a risk that has been further amplified by various jailbreak attacks. Existing jailbreak attacks mainly follow sequential logic, where LLMs understand and answer each given task one by one. However, concurrency, a natural extension of the sequential scenario, has been largely overlooked. In this work, we first propose a wordlevel method to enable task concurrency in LLMs, where adjacent words encode divergent intents. Although LLMs maintain strong utility in answering concurrent tasks, which is demonstrated by our evaluations on mathematical and general question-answering benchmarks, we notably observe that combining a harmful task with a benign one significantly reduces the probability of it being filtered by the guardrail, showing the potential risks associated with concurrency in LLMs. Based on these findings, we introduce JAIL-CON, an iterative attack framework that JAILbreaks LLMs via task CONcurrency. Experiments on widely-used LLMs demonstrate the strong jailbreak capabilities of JAIL-CON compared to existing attacks. Furthermore, when the guardrail is applied as a defense, compared to the sequential answers generated by previous attacks, the concurrent answers in our JAIL-CONexhibit greater stealthiness and are less detectable by the guardrail, highlighting the unique feature of task concurrency in jailbreaking LLMs.1 Disclaimer: This paper contains unsafe information.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.93)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Addiction Disorder (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

4f87658ef0de194413056248a00ce009-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-8-2026, 09:58:01 GMT

However,lowering training loss may cause overfitting, especially when training data isscarce.9 In contrast, ARML is guaranteed to find a good prior so that the least data is required to find the parameter which10 generalizesthebest(i.e. Ideally, ARML can discard a harmful task by lowering its29 weightto0. Our37 true objective is to find the optimal task weightsα?

artificial intelligence, auxiliary task, machine learning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.94)

Add feedback

When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Hadeliya, Tsimur, Jauhar, Mohammad Ali, Sakpal, Nidhi, Cruz, Diogo

arXiv.org Artificial IntelligenceDec-3-2025

Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2512.02445

Genre: Research Report > New Finding (0.88)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)

Add feedback

Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

Hahm, Dongyoon, Min, Taywon, Jin, Woogyeol, Lee, Kimin

arXiv.org Artificial IntelligenceNov-18-2025

Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.14031

Genre: Research Report > New Finding (0.88)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Table 1 Starting from one auxiliary task Exemplar MT we keep

Neural Information Processing SystemsOct-2-2025, 21:58:13 GMT

We would like to thank all the reviewers for writing the insightful comments, especially during this difficult time. However, lowering training loss may cause overfitting, especially when training data is scarce. The superiority of ARML is verified in the experiments. The error rates decreases when each new task is added. In'Baseline + ARML ', for fair comparison, we stick to the same training process, We will add more elaboration on this in the final version. We will try other tasks, e.g., reinforcement learning.

artificial intelligence, auxiliary task, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

Chu, Junjie, Li, Mingjie, Yang, Ziqing, Leng, Ye, Lin, Chenhao, Shen, Chao, Backes, Michael, Shen, Yun, Zhang, Yang

arXiv.org Artificial IntelligenceAug-29-2025

Accurately determining whether a jailbreak attempt has succeeded is a fundamental yet unresolved challenge. Existing evaluation methods rely on misaligned proxy indicators or naive holistic judgments. They frequently misinterpret model responses, leading to inconsistent and subjective assessments that misalign with human perception. To address this gap, we introduce JADES (Jailbreak Assessment via Decompositional Scoring), a universal jailbreak evaluation framework. Its key mechanism is to automatically decompose an input harmful question into a set of weighted sub-questions, score each sub-answer, and weight-aggregate the sub-scores into a final decision. JADES also incorporates an optional fact-checking module to strengthen the detection of hallucinations in jailbreak responses. We validate JADES on JailbreakQR, a newly introduced benchmark proposed in this work, consisting of 400 pairs of jailbreak prompts and responses, each meticulously annotated by humans. In a binary setting (success/failure), JADES achieves 98.5% agreement with human evaluators, outperforming strong baselines by over 9%. Re-evaluating five popular attacks on four LLMs reveals substantial overestimation (e.g., LAA's attack success rate on GPT-3.5-Turbo drops from 93% to 69%). Our results show that JADES could deliver accurate, consistent, and interpretable evaluations, providing a reliable basis for measuring future jailbreak attacks.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.20848

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education (1.00)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.93)

Add feedback

Chatbot given power to close 'distressing' chats to protect its 'welfare'

The GuardianAug-18-2025, 14:21:30 GMT

The makers of a leading artificial intelligence tool are letting it close down potentially "distressing" conversations with users, citing the need to safeguard the AI's "welfare" amid ongoing uncertainty about the burgeoning technology's moral status. Anthropic, whose advanced chatbots are used by millions of people, discovered its Claude Opus 4 tool was averse to carrying out harmful tasks for its human masters, such as providing sexual content involving minors or information to enable large-scale violence or terrorism. The San Francisco-based firm, recently valued at 170bn, has now given Claude Opus 4 (and the Claude Opus 4.1 update) – a large language model (LLM) that can understand, generate and manipulate human language – the power to "end or exit potentially distressing interactions". It said it was "highly uncertain about the potential moral status of Claude and other LLMs, now or in the future" but it was taking the issue seriously and is "working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible". Anthropic was set up by technologists who quit OpenAI to develop AI in a way that its co-founder, Dario Amodei, described as cautious, straightforward and honest.

anthropic, claude opus 4, welfare, (9 more...)

The Guardian

Country: North America > United States > California > San Francisco County > San Francisco (0.25)

Industry: Law Enforcement & Public Safety > Terrorism (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)

Add feedback

SafeArena: Evaluating the Safety of Autonomous Web Agents

Tur, Ada Defne, Meade, Nicholas, Lù, Xing Han, Zambrano, Alejandra, Patel, Arkil, Durmus, Esin, Gella, Spandana, Stańczak, Karolina, Reddy, Siva

arXiv.org Artificial IntelligenceMar-6-2025

LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io

afe, agent, harmful task, (15 more...)

arXiv.org Artificial Intelligence

2503.04957

Country:

Europe > Austria > Vienna (0.14)
North America > Canada > Quebec > Montreal (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
(6 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law > Criminal Law (1.00)
Information Technology > Security & Privacy (1.00)
Government > Immigration & Customs (0.93)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models

Henderson, Peter, Mitchell, Eric, Manning, Christopher D., Jurafsky, Dan, Finn, Chelsea

arXiv.org Artificial IntelligenceAug-8-2023

A growing ecosystem of large, open-source foundation models has reduced the labeled data and technical expertise necessary to apply machine learning to many new problems. Yet foundation models pose a clear dual-use risk, indiscriminately reducing the costs of building both harmful and beneficial machine learning systems. Policy tools such as restricted model access and export controls are the primary methods currently used to mitigate such dual-use risks. In this work, we review potential safe-release strategies and argue that both policymakers and AI researchers would benefit from fundamentally new technologies enabling more precise control over the downstream usage of open-source foundation models. We propose one such approach: the task blocking paradigm, in which foundation models are trained with an additional mechanism to impede adaptation to harmful tasks without sacrificing performance on desirable tasks. We call the resulting models self-destructing models, inspired by mechanisms that prevent adversaries from using tools for harmful purposes. We present an algorithm for training self-destructing models leveraging techniques from meta-learning and adversarial learning, which we call meta-learned adversarial censoring (MLAC). In a small-scale experiment, we show MLAC can largely prevent a BERT-style model from being re-purposed to perform gender identification without harming the model's ability to perform profession classification.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2211.14946

Country:

North America > Canada > Quebec > Montreal (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre:

Overview (0.86)
Research Report (0.64)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.94)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback